In this lab you’ll have the chance to go over a number of things that we covered before reading week. We will focus on data subsetting, data visualisation, and building some linear models.
We will be using two sets of packages - the tidyverse packages which includes dplyr, ggplot2 etc. and the gapminder package which includes some nice global data related to population size, life expectancy, GDP etc for a bunch of countries over a number of years.
First of all, load the tidyverse and gapmiinder packages to your library. Remember, if you haven’t installed either before, you’ll need to first type install.packages("packagename") in the console. Obviously, replace “packagename” with the name of the package you want to download!
library(tidyverse)
library(gapminder)
Frist we’re going to build a scatterplot of life expectancy against GDP per capita using the gapminder dataset. If you want to explore the dataset before you use it you could type >help(gapminder) or >str(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(title = "Scatterplot of Life Expectancy against GDP per capita",
x = "GDP per capita",
y = "Life Expectancy (years)")
Now let’s separate out the data by Contintent.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point() +
labs(title = "Scatterplot of Life Expectancy against GDP per capita by Continent",
x = "GDP per capita",
y = "Life Expectancy (years)",
colour = "Continent")
We can also use facet_wrap() to split by Continent which might make things a little easier to see.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, colour = continent)) +
geom_point() +
facet_wrap(~ continent) +
guides(colour = FALSE) +
labs(title = "Scatterplot of Life Expectancy against GDP per capita by Continent",
x = "GDP per capita",
y = "Life Expectancy (years)",
colour = "Continent")
Now we’re going to focus on just one country, Sweden - famous for pop pioneers, Abba, and everyone’s favourite progressive death metal band, Opeth. No? Just mine then…
We will use the filter() function from the dplyr package to include only data from “Sweden” in our new variable I’m calling gapminder_sub1.
gapminder_sub1 <- filter(gapminder, country == "Sweden")
Let’s plot a scatterplot of life expectancy against GDP and add a regression line.
ggplot(gapminder_sub1, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Sweden Life Expectancy by GDP",
x = "GDP per capita",
y = "Life Expectancy (years)")
We can formally calculate the linear model using the lm() function.
model <- lm(lifeExp ~ gdpPercap, data = gapminder_sub1)
summary(model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder_sub1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6476 -0.3361 0.0124 0.1841 1.1721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.853e+01 4.410e-01 155.38 < 2e-16 ***
## gdpPercap 3.834e-04 2.074e-05 18.49 4.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5311 on 10 degrees of freedom
## Multiple R-squared: 0.9716, Adjusted R-squared: 0.9687
## F-statistic: 341.9 on 1 and 10 DF, p-value: 4.618e-09
We can look at some diagnostic plots to make sure everything looks ok.
plot(model)
Generally things look fine although point 12 is just outside .5 of Cook’s distance. Now we’re going to do the same for the UK.
gapminder_sub2 <- filter(gapminder, country == "United Kingdom")
ggplot(gapminder_sub2, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "UK Life Expectancy by GDP",
x = "GDP per capita",
y = "Life Expectancy (years)")
model <- lm(lifeExp ~ gdpPercap, data = gapminder_sub2)
summary(model)
##
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = gapminder_sub2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75546 -0.29292 -0.03036 0.18865 0.99229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.515e+01 4.231e-01 153.96 < 2e-16 ***
## gdpPercap 4.527e-04 2.051e-05 22.07 8.17e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5026 on 10 degrees of freedom
## Multiple R-squared: 0.9799, Adjusted R-squared: 0.9779
## F-statistic: 487.2 on 1 and 10 DF, p-value: 8.166e-10
plot(model)
In this case, point 12 is looking a little extreme.
Now, we’re going to combine our two dataframes using the rbind() function. This function will take two separate dataframes and bind them together by rows (i.e., one on top of each other) to produce a new dataframe. You can also combine two dataframes side by side by binding together by columns using the cbind() function. Here we’re sticking with rbind() and we’re mapping the output of this onto a new variable I’m calling data_combined.
data_combined <- rbind(gapminder_sub1, gapminder_sub2)
We’re now going to produce a different kind of plot. We’re going to build a violin plot showing the distribution of life expectancy data for our two countries. We’re also adding some descriptive statistics using the stat_summary() function - we’re adding the mean and SE bars.
ggplot(data_combined, aes(x = country, y = lifeExp, fill = country)) +
geom_violin() +
geom_jitter(width = .1, alpha = .3) +
stat_summary(fun.data = mean_se, geom = "errorbar", width=.25) +
stat_summary(fun.y = mean, geom = "point", size = 4) +
guides(fill = FALSE) +
labs(title = "Life Expectancy for Sweden and the United Kingdom",
x = "Country",
y = "Life Expectancy (years)")
Now let’s look at the countries in Europe using our filter() function again and life expectancy against GDP.
Instead of creating temporary variables, we can use the pipe operator %>% (read as “and then”) which allows us to filter the dataset before we pass it to ggplot(). In this case, we’re selecting only the Europe continent in our dataset.
gapminder %>%
filter(continent == "Europe") %>%
ggplot(aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
labs(title = "Life Expectancy against GDP per capita \nfor countries in Europe",
x = "GDP per capita",
y = "Life Expectancy (years)")
Now we’re going to plot boxplots of the Life Expectacy data for all the countries in Europe. Notice how I’m using the fct_reorder() function to reorder the levels of our factor based on the median of each country’s Life Expectancy data.
gapminder %>%
filter(continent == "Europe") %>%
ggplot(aes(x = fct_reorder(country, lifeExp, median), y = lifeExp, fill = country)) +
geom_boxplot() +
coord_flip() +
guides(fill = FALSE) +
labs(title = "Boxplots of Life Expectancy in Countries in Europe over Time",
x = "Country",
y = "Life Expectancy (years)")
We’re now going to look at a dataset built into the tidyverse - it’s the diamonds dataset and contains the prices and other attributes of almost 54,000 diamonds. Let’s check the structure first using str()
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Let’s first plot Price against Clarity while also capturing Carat. Remember the alpha parameter in the geom_jitter() layer sets the level of translucency of each point - it can vary from 0 to 1. Change it from .1 to 1 to see what happens…
ggplot(diamonds, aes(x = clarity, y = price, colour = carat)) +
geom_jitter(alpha = .1) +
labs(x = "Clarity (Worst to Best)",
y = "Price in USD",
colour = "Carat")
Now let’s plot a histogram of diamond prices.
ggplot(diamonds, aes(x = price)) +
geom_histogram(bins = 50) +
labs(title = "Histogram of Diamond Prices",
x = "Price in USD")
We’re now going to look at histogrames of diamond prices separately for each type of diamond ‘cut’. Notice we’re also plotting in grey the histogram of the overall dataset. For the first geom_histogram() call we’ve using a filtered version of our dataset using the select() function to allow us to plot this background. The - means we’re dropping the variable cut from the dataset diamonds. The next geom_histogram() call uses the unfiltered dataset inherited from ealier in the pipe - we then facet_wrap() by the variable labelled cut.
diamonds %>%
ggplot(aes(x = price)) +
geom_histogram(data = select(diamonds, -cut), bins = 50, fill = "grey") +
geom_histogram(bins = 50) +
facet_wrap(~ cut) +
labs(title = "Histograms of Diamond Prices Faceted by Diamond Cut",
x = "Price in USD")
Now let’s look at the whole data set again with Price plotted against type of diamond ‘cut’. We’re also adding some descriptive statistics using the stat_summary() function.
diamonds %>%
ggplot(aes(x = cut, y = price, colour = carat)) +
geom_jitter(alpha = .1) +
stat_summary(fun.data = mean_cl_boot, colour="red") +
labs(x = "Diamond Cut", y = "Price in USD", colour="Carat")
Let’s work out some descriptives ourselves using group_by() and summarise() from the dplyr package, alongside a bit of piping %>%. We’re working out (grouped by cut) the mean and standard deviations for price and carat.
diamonds %>%
group_by(cut) %>%
summarise(mean_price = mean(price), sd_price = sd(price), count = n(), mean_carat = mean(carat),
sd_carat = sd(carat))
## # A tibble: 5 x 6
## cut mean_price sd_price count mean_carat sd_carat
## <ord> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Fair 4359. 3560. 1610 1.05 0.516
## 2 Good 3929. 3682. 4906 0.849 0.454
## 3 Very Good 3982. 3936. 12082 0.806 0.459
## 4 Premium 4584. 4349. 13791 0.892 0.515
## 5 Ideal 3458. 3808. 21551 0.703 0.433
Let’s choose a mid-ranking clarity level and smaller diamonds (< .5 carats). We use the filter() function to do that. Can a diamond’s price be predicted by its carat level?
diamonds %>%
filter(clarity == "VS2" & carat < .5) %>%
ggplot(aes(x = carat, y = price)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "Price in USD against Carats for \nMid-Range Clarity Small Diamonds",
x = "Carats",
y = "Price in USD")
The graph certainly suggests we can predict price if we know a diamond’s carat level. Is this supported by a linear model?
model <- lm(price ~ carat, data = filter(diamonds, clarity == "VS2" & carat < .5))
summary(model)
##
## Call:
## lm(formula = price ~ carat, data = filter(diamonds, clarity ==
## "VS2" & carat < 0.5))
##
## Residuals:
## Min 1Q Median 3Q Max
## -377.84 -104.68 -3.57 121.74 1016.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -155.47 16.07 -9.675 <2e-16 ***
## carat 2668.40 46.51 57.372 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 143.2 on 4084 degrees of freedom
## Multiple R-squared: 0.4463, Adjusted R-squared: 0.4461
## F-statistic: 3292 on 1 and 4084 DF, p-value: < 2.2e-16
plot(model)
Looks like we have a pretty nice fitting model with only one or two outliers.